Load the college application data from Lab1 and create the variable Elite by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50 %. We will also save the College names as a new variable and remove Accept and Enroll as temporally they occur after applying, and do not make sense as predictors in future data.
data(College)
College = College %>% mutate(college = rownames(College)) %>%
mutate(Elite = factor(Top10perc > 50)) %>%
mutate(Elite = recode(Elite, "TRUE" = "Yes", "FALSE" ="No")) %>%
dplyr::select(c(-Accept,-Enroll))
colnames(College)
## [1] "Private" "Apps" "Top10perc" "Top25perc" "F.Undergrad"
## [6] "P.Undergrad" "Outstate" "Room.Board" "Books" "Personal"
## [11] "PhD" "Terminal" "S.F.Ratio" "perc.alumni" "Expend"
## [16] "Grad.Rate" "college" "Elite"
We are going to create a training and test set by randomly splitting the data. First set a random seed by
# do not change this; for a break google `8675309`
set.seed(8675309)
n = nrow(College)
n.train = floor(.75*n)
train = sample(1:n, size=n.train, replace=FALSE)
College.train = College[train,]
College.test = College[-train,]
#summary for training set
summary(College.train[,-17])
## Private Apps Top10perc Top25perc
## No :174 Min. : 81.0 Min. : 1.00 Min. : 9.00
## Yes:408 1st Qu.: 804.2 1st Qu.:15.00 1st Qu.: 40.00
## Median : 1590.5 Median :23.00 Median : 54.00
## Mean : 2969.6 Mean :26.99 Mean : 55.24
## 3rd Qu.: 3753.5 3rd Qu.:35.00 3rd Qu.: 68.00
## Max. :20192.0 Max. :95.00 Max. :100.00
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2700 Min. :1880
## 1st Qu.: 1025 1st Qu.: 107.8 1st Qu.: 7036 1st Qu.:3587
## Median : 1773 Median : 372.0 Median : 9806 Median :4181
## Mean : 3735 Mean : 877.9 Mean :10255 Mean :4343
## 3rd Qu.: 4260 3rd Qu.: 1077.2 3rd Qu.:12722 3rd Qu.:5011
## Max. :31643 Max. :21836.0 Max. :21700 Max. :7425
## Books Personal PhD Terminal
## Min. : 110.0 Min. : 250 Min. : 8.00 Min. : 30.00
## 1st Qu.: 482.5 1st Qu.: 875 1st Qu.: 63.00 1st Qu.: 71.00
## Median : 525.0 Median :1200 Median : 76.00 Median : 83.00
## Mean : 555.6 Mean :1333 Mean : 72.97 Mean : 80.07
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.00
## Max. :2340.0 Max. :4913 Max. :103.00 Max. :100.00
## S.F.Ratio perc.alumni Expend Grad.Rate Elite
## Min. : 2.5 Min. : 0.00 Min. : 3365 Min. : 10.00 No :527
## 1st Qu.:11.7 1st Qu.:13.00 1st Qu.: 6720 1st Qu.: 53.00 Yes: 55
## Median :13.8 Median :20.00 Median : 8326 Median : 65.00
## Mean :14.3 Mean :22.22 Mean : 9455 Mean : 65.48
## 3rd Qu.:16.7 3rd Qu.:31.00 3rd Qu.:10648 3rd Qu.: 78.00
## Max. :39.8 Max. :64.00 Max. :56233 Max. :100.00
#summary for test set
summary(College.test[,-17])
## Private Apps Top10perc Top25perc
## No : 38 Min. : 152 Min. : 3.00 Min. : 19.00
## Yes:157 1st Qu.: 672 1st Qu.:17.00 1st Qu.: 44.50
## Median : 1496 Median :25.00 Median : 55.00
## Mean : 3097 Mean :29.25 Mean : 57.47
## 3rd Qu.: 3498 3rd Qu.:36.00 3rd Qu.: 69.00
## Max. :48094 Max. :96.00 Max. :100.00
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 282.0 Min. : 7.0 Min. : 2340 Min. :1780
## 1st Qu.: 927.5 1st Qu.: 71.5 1st Qu.: 7825 1st Qu.:3606
## Median : 1506.0 Median : 334.0 Median :10658 Median :4220
## Mean : 3595.6 Mean : 787.9 Mean :10996 Mean :4401
## 3rd Qu.: 3496.0 3rd Qu.: 877.5 3rd Qu.:13298 3rd Qu.:5114
## Max. :27378.0 Max. :9054.0 Max. :20100 Max. :8124
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 300 Min. :22.00 Min. : 24.00
## 1st Qu.: 450.0 1st Qu.: 850 1st Qu.:60.00 1st Qu.: 69.00
## Median : 500.0 Median :1235 Median :74.00 Median : 80.00
## Mean : 530.7 Mean :1364 Mean :71.74 Mean : 78.62
## 3rd Qu.: 600.0 3rd Qu.:1653 3rd Qu.:86.50 3rd Qu.: 92.50
## Max. :1000.0 Max. :6800 Max. :99.00 Max. :100.00
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 4.60 Min. : 0.00 Min. : 3186 Min. : 21.00
## 1st Qu.:11.20 1st Qu.:14.00 1st Qu.: 6898 1st Qu.: 53.00
## Median :13.00 Median :24.00 Median : 8575 Median : 66.00
## Mean :13.45 Mean :24.31 Mean :10272 Mean : 65.41
## 3rd Qu.:15.45 3rd Qu.:32.00 3rd Qu.:11045 3rd Qu.: 76.00
## Max. :27.60 Max. :60.00 Max. :42926 Max. :118.00
## Elite
## No :172
## Yes: 23
##
##
##
##
ANSWER: Catergorical variables: Private and Elite. College name should not be treated as a variable. All other variables left should be treated as numerical variables.
From the summaries given above, variables in both train set and test set have similar ranges and mean. One thing we need pay attention to is that the ratio of public school vs private school in training set is 0.43 while the ratio in test set is only 0.24. This may introduce bias since private school and public school have many differences.
Apps using the training data only. If you use pairs or preferably ggpairs make sure that Apps is on the y-axis in plots versus the other predictors. (Make sure that the plots are legible, which may require multiple plots.)#melt data
College.train_melt = College.train %>% select(-c(Private,college, Elite)) %>%
gather(feature, value, c(Top10perc:Grad.Rate))
College.train_melt_f = College.train %>% select(-college) %>%
select(c(Private,Apps,Elite)) %>%
gather(feature, value, c(Private,Elite))
College.train_melt_1 = College.train_melt[1:2328,]
College.train_melt_2 = College.train_melt[2329:4656,]
College.train_melt_3 = College.train_melt[4657:6984,]
College.train_melt_4 = College.train_melt[6985:8148,]
ggplot(College.train_melt_f, aes(value, Apps)) +
geom_point(color = "firebrick") +
facet_wrap(~feature, scales = "free")
ggplot(College.train_melt_1, aes(value, Apps)) +
geom_point(color = "firebrick") +
facet_wrap(~feature, scales = "free")
ggplot(College.train_melt_2, aes(value, Apps)) +
geom_point(color = "firebrick") +
facet_wrap(~feature, scales = "free")
ggplot(College.train_melt_3, aes(value, Apps)) +
geom_point(color = "firebrick") +
facet_wrap(~feature, scales = "free")
ggplot(College.train_melt_4, aes(value, Apps)) +
geom_point(color = "firebrick") +
facet_wrap(~feature, scales = "free")
ANSWER:We first looked at two categorical variables: Elite and Private. Under Yes or No, most colleges have applications less than 15000. So, we can consider applications greater than 20000 are outliers.
We then paid attention to numerical variables. We could see clearly that the number of full time undergraduate has strong linear relation with applications. Variables ‘Top10perc’ and ‘Top25perc’ have weak linear relationship with applications. For other numerical variables, there is no linear relation with applications. Outliers exist in all variables which indicate appropriate tranformation is needed.
Apps from the other predictors (without any transformations). Present model summaries and diagnostic plots. Based on diagnostic plots using residuals, comment on the adequacy of your model.app_lm = lm(Apps ~.-college, data = College.train)
summary(app_lm)
##
## Call:
## lm(formula = Apps ~ . - college, data = College.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5303.7 -698.7 -66.3 474.7 8310.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.863e+03 6.513e+02 -2.861 0.004386 **
## PrivateYes -5.926e+02 2.225e+02 -2.663 0.007961 **
## Top10perc -4.819e+00 1.103e+01 -0.437 0.662281
## Top25perc 1.024e+01 7.576e+00 1.351 0.177237
## F.Undergrad 6.239e-01 2.063e-02 30.245 < 2e-16 ***
## P.Undergrad -9.703e-02 4.991e-02 -1.944 0.052378 .
## Outstate 5.200e-02 3.040e-02 1.711 0.087666 .
## Room.Board 3.514e-01 7.928e-02 4.433 1.12e-05 ***
## Books 2.117e-01 3.673e-01 0.576 0.564636
## Personal -2.575e-01 1.081e-01 -2.381 0.017591 *
## PhD 5.883e+00 7.472e+00 0.787 0.431378
## Terminal -1.702e+01 8.432e+00 -2.019 0.044005 *
## S.F.Ratio 1.563e+01 2.077e+01 0.753 0.452010
## perc.alumni -2.286e+01 6.773e+00 -3.376 0.000787 ***
## Expend 8.686e-02 2.136e-02 4.067 5.43e-05 ***
## Grad.Rate 1.578e+01 4.762e+00 3.314 0.000980 ***
## EliteYes 1.033e+03 3.435e+02 3.008 0.002749 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1463 on 565 degrees of freedom
## Multiple R-squared: 0.8261, Adjusted R-squared: 0.8211
## F-statistic: 167.7 on 16 and 565 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(app_lm, labels.id = College.train$college)
ANSWER: From the residual plot vs fitted value, we see that the residual variance increases. The model predicts negative Apps, which is not meaningful. From the QQplot, there is clear heavy tail. From scale-location plot, we can see an increasing, linear trend of the standardized residuals vs fitted values. Therefore, we can conclude that a linear model without transformations and/or interactions is not adequate.
#rmse function
rmse = function(y, ypred){
rmse=sqrt(mean((y-ypred)^2))
return(rmse)
}
#number of replicates
nsim=1000
#Create design matrix
X = model.matrix(app_lm)
#Get 1000 samples of parameters
params = sim(app_lm, nsim)
#Create matrix to store 1000 sets of y's
y_rep = matrix(0, nrow = nsim ,ncol = nrow(College.train))
#Create vector to store 1000 RMSE statistic
RMSE.2<-rep(0,nsim)
#compute RMSE based on simulations
for(i in 1:nsim){
y_rep[i,] = X %*% params@coef[i,] + rnorm(1, 0, params@sigma[i])
RMSE.2[i]<-rmse(predict(app_lm), y_rep[i,])
}
RMSE.2 = data.frame(RMSE.2)
#compute RMSE based on train data
RMSE.1<-rmse(College.train$Apps,predict(app_lm))
#plot the histogram
GGPLOT.4<-ggplot(data= RMSE.2, aes(x=RMSE.2))+
geom_histogram() +
geom_vline(xintercept = RMSE.1, color="red")+labs(x="RMSE", y="Count")
plot(GGPLOT.4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#compute p-value
pval.lm<-sum(RMSE.2>RMSE.1)/nsim
RMSE.1
## [1] 1441.111
print(pval.lm)
## [1] 0.315
ANSWER: The RMSE of the observed data is not inconsistent with the simulated model. In fact, the tail probability, or p-value, of ‘r pval’ for RMSE is too large to not reject this model (based on this particular statistic).
ANSWER: First we perform a log transformation on ‘Apps’. From the residual plot, we can see that the latter is much better than for the non-transformed model. We also applied the boxcox procedure to find the best y tranformation. And we can see 0 is close to the best parameter. Therefore, for a better intepratation of y, we admit log transformation on y. (More below)
#log transformation on y
app_lm_yt = lm(log(Apps)~.-college, data = College.train)
par(mfrow = c(2,2))
#diagnostic plots
plot(app_lm_yt)
#boxcox procedure to find best y transformation
boxcox(app_lm)
However, there is still a linear trend between residuals and fitted value based on the residual plots. We next move to x transformations. Let us go back to the scatter plots in question 2. As we have already seen, data is not evenly distributed. We applied log transformation on x which have many leverage points. For ‘Top25perc’ and ‘Grad.Rate’, we applied quadratic transformation on x since there is quadartic trend. (More below)
#log transformation on y
app_lm_t = lm(log(Apps) ~ Private + log(Top10perc) + (Top25perc)^2 +
log(F.Undergrad) +log(P.Undergrad) + Outstate +
(Room.Board) + log(Books) + log(Personal) +
log(PhD)+Terminal+log(S.F.Ratio)+ perc.alumni +
log(Expend) + (Grad.Rate)^2+ Elite,
data=College.train)
summary(app_lm_t)
##
## Call:
## lm(formula = log(Apps) ~ Private + log(Top10perc) + (Top25perc)^2 +
## log(F.Undergrad) + log(P.Undergrad) + Outstate + (Room.Board) +
## log(Books) + log(Personal) + log(PhD) + Terminal + log(S.F.Ratio) +
## perc.alumni + log(Expend) + (Grad.Rate)^2 + Elite, data = College.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46249 -0.22772 0.02455 0.21071 2.10428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.568e+00 9.422e-01 -2.726 0.006612 **
## PrivateYes 5.431e-03 6.293e-02 0.086 0.931257
## log(Top10perc) -1.976e-01 5.347e-02 -3.696 0.000241 ***
## Top25perc 5.679e-03 1.952e-03 2.909 0.003767 **
## log(F.Undergrad) 9.719e-01 2.951e-02 32.935 < 2e-16 ***
## log(P.Undergrad) -5.516e-02 1.474e-02 -3.743 0.000200 ***
## Outstate 2.154e-05 8.322e-06 2.589 0.009884 **
## Room.Board 8.466e-05 2.169e-05 3.904 0.000106 ***
## log(Books) 1.444e-02 6.347e-02 0.228 0.820104
## log(Personal) -2.183e-02 3.784e-02 -0.577 0.564262
## log(PhD) 2.419e-01 8.760e-02 2.761 0.005941 **
## Terminal -1.821e-03 1.933e-03 -0.942 0.346545
## log(S.F.Ratio) 5.713e-02 8.130e-02 0.703 0.482511
## perc.alumni -3.655e-04 1.797e-03 -0.203 0.838929
## log(Expend) 1.546e-01 8.481e-02 1.823 0.068760 .
## Grad.Rate 3.145e-03 1.245e-03 2.525 0.011838 *
## EliteYes 1.608e-01 7.005e-02 2.296 0.022049 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3795 on 565 degrees of freedom
## Multiple R-squared: 0.8756, Adjusted R-squared: 0.8721
## F-statistic: 248.6 on 16 and 565 DF, p-value: < 2.2e-16
#diagnostic plots
par(mfrow = c(2,2))
plot(app_lm_t)
Finally, we applied step function(BIC) to see if there are potential interation terms. From the plots, we can see, both residual plots and QQplot are much better than before. There, we finally choose the model given below are the best linear model for this data.
# Systematic check for interactions
app_lm_t2<-lm(log(Apps) ~ Private + log(Top10perc) + (Top25perc)^2 +
log(F.Undergrad) + log(P.Undergrad) + Outstate +
(Room.Board) + log(Books) + log(Personal) +
log(PhD)+Terminal+log(S.F.Ratio)+ perc.alumni +
log(Expend) + Grad.Rate+ (Elite)^2,
data=College.train)
#our best linear model using BIC
app_best = step(app_lm_t2, k=log(nrow(College.train)),trace=0)
summary(app_best)
##
## Call:
## lm(formula = log(Apps) ~ log(Top10perc) + Top25perc + log(F.Undergrad) +
## log(P.Undergrad) + Outstate + Room.Board + log(PhD) + Elite,
## data = College.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52512 -0.23310 0.02069 0.22267 2.09038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.086e+00 2.559e-01 -4.244 2.56e-05 ***
## log(Top10perc) -1.732e-01 5.089e-02 -3.404 0.00071 ***
## Top25perc 5.779e-03 1.881e-03 3.073 0.00222 **
## log(F.Undergrad) 9.778e-01 2.288e-02 42.736 < 2e-16 ***
## log(P.Undergrad) -6.152e-02 1.422e-02 -4.327 1.79e-05 ***
## Outstate 3.134e-05 6.572e-06 4.769 2.35e-06 ***
## Room.Board 9.707e-05 2.051e-05 4.734 2.78e-06 ***
## log(PhD) 1.986e-01 6.711e-02 2.959 0.00321 **
## EliteYes 1.828e-01 6.830e-02 2.677 0.00765 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3802 on 573 degrees of freedom
## Multiple R-squared: 0.8734, Adjusted R-squared: 0.8716
## F-statistic: 493.9 on 8 and 573 DF, p-value: < 2.2e-16
#diagnostic plots
par(mfrow=c(2,2))
plot(app_best,labels.id = College.train$college)
# Termplots for further diagnostics
par(mfrow=c(2,2))
termplot(app_best, data=College.train,partial.resid=T, smooth=panel.smooth, se=T)
nsim=1000
rmse = function(y, ypred){
rmse=sqrt(mean((y-ypred)^2))
return(rmse)
}
X<-model.matrix(app_best,
data = College.train)
params<-sim(app_best, nsim)
Y<-matrix(0, nrow=nrow(College.train),ncol=nsim )
RMSE.2<-rep(0,nsim)
# note that we had log-transformed the response; hence we need to exponentiate the simulated responses
for(j in 1:nsim){
Y[,j]<-exp(X %*% params@coef[j,]+ rnorm(1, 0, params@sigma[j]))
RMSE.2[j]<-rmse(Y[,j],exp(predict(app_best)))
}
RMSE.2<-as.data.frame(RMSE.2)
RMSE.1.bestlm<-rmse(College.train$Apps,exp(predict(app_best)))
GGPLOT.6<-ggplot(data= RMSE.2, aes(x=RMSE.2))+geom_histogram() + geom_vline(xintercept = RMSE.1.bestlm, col="red")+labs(x="RMSE", y="Count")
print(GGPLOT.6)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#pvalue
pval.best<-sum(RMSE.2>RMSE.1)/nsim
print(pval.best)
## [1] 0.386
ANSWER: We find that the observed value is closer to the mode of the simulated RMSE distribution. As indicated by the current p-value of ‘r pval’, the current model is a better fit to the observed data in the training set. Of course, this is based on a single summary statistic only (RMSE), and cannot be used for final approval of the model.
College.test. Plot the predicted residuals \(y_i - \hat{y}_i\) versus the predictions. Are there any cases where the model does a poor job of predicting? Compute the RMSE using the test data where now RMSE = \(\sqrt{\sum_{i = 1}^{n.test}(y_i - \hat{y}_i)^2/n.test}\) where the sum is over the test data. Which model is better for the out of sample prediction?ANSWER: First we plot the predictions for the base model with main effects only, and the ‘best’ model obtained by BIC. We found that a few predictions have an excessively high residuals, so we removed those from the plot.
# rerun the base model
apps.lm0<-lm(Apps~., data=College.train[,-17])
a.new<-predict(apps.lm0, newdata = College.test[,-17])
b.new<-predict(app_best, newdata=College.test)
# remove those with residual > 8000
ind.a<-which(abs(a.new-College.test$Apps)>8000)
ind.b<-which(abs(exp(b.new)-College.test$Apps)>8000)
par(mfrow=c(1,2))
plot(a.new[-ind.a],a.new[-ind.a]-College.test$Apps[-ind.a],xlab="Predicted", ylab="Residual",main ="Base model (#4)")
plot(exp(b.new[-ind.b]), exp(b.new[-ind.b])-College.test$Apps[-ind.b],xlab="Predicted", ylab="Residual",main="Best model (#6)")
# colleges with extremely high number of applications
print(College$college[ind.a])
## [1] "Antioch University" "College of Saint Catherine"
print(College$college[ind.b])
## [1] "College of Saint Catherine"
The worst fits (residual >8000) are for “Antioch University” and “College of Saint Catherine” for the base model, and “College of Saint Catherine” and “Dordt College” for the ‘best’ linear model. Next, we compute the RMSE of the test data.
RMSE.0.test<-sqrt(mean((a.new-College.test$Apps)^2))
RMSE.best.test<-sqrt(mean((exp(b.new)-(College.test$Apps))^2))
print(RMSE.0.test)
## [1] 2894.803
print(RMSE.best.test)
## [1] 2781.898
With an RMSE of ‘r RMSE.best.test’ the out of sample prediction is better for the model with interactions and transformations compared to the base model which has an RMSE of ‘r RMSE.0.test’.
par(mfrow=c(1,2))
plot(GGPLOT.4+geom_vline(xintercept = RMSE.0.test, col="yellow"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot(GGPLOT.6+ geom_vline(xintercept = RMSE.best.test, col="yellow"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pval.1<-sum(RMSE.0.test>RMSE.2)/nsim
print(pval.1)
## [1] 0.904
pval.2<-sum(RMSE.best.test>RMSE.2)/nsim
print(pval.2)
## [1] 0.899
ANSWER: The p-values of the test data are ‘r pval.1’ and ‘r pval.2’, respectively. This means if the model were true, the observed test RMSE’s are quite unlikely. This implies that we cannot accurately predict the left out data with this model.
Apps.poi = glm(Apps ~ . -college, family = poisson(link = "log"), data = College.train)
summary(Apps.poi)
##
## Call:
## glm(formula = Apps ~ . - college, family = poisson(link = "log"),
## data = College.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -84.601 -18.714 -6.817 9.334 105.987
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.366e+00 9.451e-03 567.77 <2e-16 ***
## PrivateYes -7.638e-01 2.908e-03 -262.61 <2e-16 ***
## Top10perc -5.634e-03 1.095e-04 -51.46 <2e-16 ***
## Top25perc 4.267e-03 8.845e-05 48.24 <2e-16 ***
## F.Undergrad 7.523e-05 1.667e-07 451.16 <2e-16 ***
## P.Undergrad 2.860e-05 3.947e-07 72.48 <2e-16 ***
## Outstate 4.686e-05 3.808e-07 123.05 <2e-16 ***
## Room.Board 1.173e-04 1.014e-06 115.72 <2e-16 ***
## Books 3.013e-04 4.777e-06 63.07 <2e-16 ***
## Personal -6.213e-05 1.475e-06 -42.12 <2e-16 ***
## PhD 8.270e-03 1.225e-04 67.53 <2e-16 ***
## Terminal -2.724e-03 1.391e-04 -19.59 <2e-16 ***
## S.F.Ratio 1.909e-02 2.528e-04 75.51 <2e-16 ***
## perc.alumni -1.120e-02 9.122e-05 -122.82 <2e-16 ***
## Expend 2.491e-05 1.875e-07 132.83 <2e-16 ***
## Grad.Rate 1.060e-02 7.035e-05 150.62 <2e-16 ***
## EliteYes 1.947e-01 3.809e-03 51.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1753917 on 581 degrees of freedom
## Residual deviance: 386292 on 565 degrees of freedom
## AIC: 391729
##
## Number of Fisher Scoring iterations: 5
par(mfrow=c(2,2))
plot(Apps.poi)
#checking for overdispersion
Apps.poi$deviance/Apps.poi$df.residual
## [1] 683.7027
ANSWER: Due to the inflexible variance structure (variance=mean) of the Poisson, we find that all predictors are highly significant. The residual deviance is very large, and the ratio of residual deviance divided by the residual degrees of freedom is 683 which is \(\qq 1\), indicating substantial overdispersion (no need for a formal test here!). In the diagnostic plots, we observe that some of the standard deviance residuals are huge - indicating cause for concern. In summary, the overdispersion indicates a lack of fit.
ANSWER: The RMSE of the model is completely out of whack compared to the replicated data (p value is estiamted to be 0 based on 1000 simulations). In conculsion, based on the RMSE as a test statistic, we have very strong evidence to reject the Poisson model.
S = 1000
beta = sim(Apps.poi, n.sims = 1000)
X = model.matrix(Apps.poi)
Y = matrix(NA, nrow(College.train), ncol = S)
rmse.train.poi = rmse(predict(Apps.poi), College.train$Apps)
rmse.rep.poi = numeric(S)
for (s in 1:S) {
lambda = exp(X %*% beta@coef[s,])
Y[, s] = rpois(nrow(College.train),lambda)
rmse.rep.poi[s] = rmse(predict(Apps.poi), Y[, s])
}
rmse.rep.poi = as.data.frame(rmse.rep.poi)
hist.poi <-ggplot(data= rmse.rep.poi, aes(x=rmse.rep.poi)) + geom_histogram(fill = "slateblue") + geom_vline(xintercept = rmse.train.poi, col="red")
print(hist.poi)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
rmse.train.poi
## [1] 4550.667
pval.poi <-sum(rmse.rep.poi > rmse.train.poi)/S
print(pval.poi)
## [1] 0
ANSWER: Below, we show the simulated RMSE, together with the training RMSE (red) and the test RMSE (blue) – we see that the latter is much larger than the training RMSE (not surprising).
Apps.poi = glm(Apps ~ . , family = poisson(link = "log"), data = College.train[,-17])
rmse.test = rmse(College.test$Apps, predict(Apps.poi, newdata=College.test[,-17]))
hist.poi <- hist.poi + geom_vline(xintercept = rmse.test, colour = "cornflowerblue")
rmse.test
## [1] 5787.812
plot(hist.poi)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ANSWER: Given the overdispersion, a first step is to add a variable dispersion parameter, i.e., use a quasi-Poisson model. Of course, this will not change the deviance as the likelihood remains the same, but it will increase the standard errors and give a better idea of which predictors are important. (See explanation of plots below)
# Quasi-Poisson
Apps.qpoi.1 <- glm(Apps ~ . , family = quasipoisson(link = "log"), data = College.train[,-17])
summary(Apps.qpoi.1)
##
## Call:
## glm(formula = Apps ~ ., family = quasipoisson(link = "log"),
## data = College.train[, -17])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -84.601 -18.714 -6.817 9.334 105.987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.366e+00 2.572e-01 20.867 < 2e-16 ***
## PrivateYes -7.638e-01 7.914e-02 -9.651 < 2e-16 ***
## Top10perc -5.634e-03 2.979e-03 -1.891 0.05911 .
## Top25perc 4.267e-03 2.407e-03 1.773 0.07678 .
## F.Undergrad 7.522e-05 4.537e-06 16.581 < 2e-16 ***
## P.Undergrad 2.860e-05 1.074e-05 2.664 0.00795 **
## Outstate 4.686e-05 1.036e-05 4.522 7.46e-06 ***
## Room.Board 1.173e-04 2.759e-05 4.253 2.47e-05 ***
## Books 3.013e-04 1.300e-04 2.318 0.02081 *
## Personal -6.213e-05 4.013e-05 -1.548 0.12214
## PhD 8.270e-03 3.332e-03 2.482 0.01336 *
## Terminal -2.724e-03 3.784e-03 -0.720 0.47184
## S.F.Ratio 1.909e-02 6.878e-03 2.775 0.00570 **
## perc.alumni -1.120e-02 2.482e-03 -4.514 7.75e-06 ***
## Expend 2.491e-05 5.102e-06 4.882 1.37e-06 ***
## Grad.Rate 1.060e-02 1.914e-03 5.536 4.75e-08 ***
## EliteYes 1.947e-01 1.036e-01 1.878 0.06084 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 740.3351)
##
## Null deviance: 1753917 on 581 degrees of freedom
## Residual deviance: 386292 on 565 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
par(mfrow=c(2,2))
plot(Apps.qpoi.1)
Looking at the diagnostic plots, we find that using the overdispersed Poisson helped substantially. It brought the standard deviance residual from a max of an absolute value of 100 to a max of a little less than 4, and a max Std Pearson residual from 150 to less than 5. There is still a trend in the Scale-Location plot.
Next, we used the predictor transformations and interactions from the best linear model to see if this would resolve some of these issues (note we cannot use the step procedure for the QP because there is no explicit likelihood for this model). As shown below, we found that the pattern in the residual plot is attenuated, and the trend in the std deviance residual plot is lower than in the previous model. The overdispersion has been lowered by accounting for the additional terms, but there is still a lack of fit as indicated by the ratio of the residual deviance by the residual degrees of freedom which is ~284.
# Quasi-Poisson with transformations and interactions
Apps.qpoi.2 <- glm(Apps ~ Private + log(Top10perc) + log(Top25perc) +
log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board +
log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) +
Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate +
Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) +
log(Expend):log(Grad.Rate), family = quasipoisson(link = "log"), data = College.train)
summary(Apps.qpoi.2)
##
## Call:
## glm(formula = Apps ~ Private + log(Top10perc) + log(Top25perc) +
## log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board +
## log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) +
## Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate +
## Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) +
## log(Expend):log(Grad.Rate), family = quasipoisson(link = "log"),
## data = College.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -69.101 -11.045 -2.548 6.880 98.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.034e+01 8.595e+00 2.366 0.018312 *
## PrivateYes 5.463e-02 1.637e-01 0.334 0.738760
## log(Top10perc) -6.256e-01 1.816e-01 -3.445 0.000614 ***
## log(Top25perc) -2.330e-01 1.477e-01 -1.577 0.115423
## log(F.Undergrad) 9.019e-01 2.790e-02 32.324 < 2e-16 ***
## log(P.Undergrad) -1.564e-01 3.286e-02 -4.759 2.48e-06 ***
## Outstate -3.937e-04 1.302e-04 -3.022 0.002622 **
## Room.Board 1.617e-03 4.582e-04 3.530 0.000450 ***
## log(PhD) 1.285e-01 9.028e-02 1.424 0.155065
## log(S.F.Ratio) -1.487e+00 1.067e+00 -1.394 0.164014
## log(Expend) -1.796e+00 7.301e-01 -2.460 0.014174 *
## log(Grad.Rate) -5.536e+00 2.116e+00 -2.617 0.009111 **
## PrivateYes:Room.Board -4.621e-05 3.383e-05 -1.366 0.172557
## log(Top10perc):log(Top25perc) 1.347e-01 4.390e-02 3.069 0.002248 **
## log(P.Undergrad):Outstate 1.005e-05 2.270e-06 4.427 1.15e-05 ***
## Outstate:log(Expend) 3.899e-05 1.371e-05 2.843 0.004628 **
## Room.Board:log(Expend) -1.629e-04 5.068e-05 -3.214 0.001382 **
## log(S.F.Ratio):log(Grad.Rate) 3.697e-01 2.538e-01 1.457 0.145699
## log(Expend):log(Grad.Rate) 5.388e-01 1.813e-01 2.971 0.003090 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 320.3511)
##
## Null deviance: 1753917 on 581 degrees of freedom
## Residual deviance: 160007 on 563 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
par(mfrow=c(2,2))
plot(Apps.qpoi.2)
Apps.qpoi.2$deviance/Apps.qpoi.2$df.residual
## [1] 284.2047
pchisq(Apps.qpoi.2$deviance, Apps.qpoi.2$df.residual, lower.tail = F)
## [1] 0
# Choose the model
model<-Apps.qpoi.2 # .1 for untransformed predictors, .2 for transformed predictors
# Simulate 1000 replicates
S = 1000
n<-nrow(College.train)
qp.sim = sim(model, n.sims = 1000)
# Generate the design matrix
X.q = model.matrix(model, data = College.train)
# Allocate the output
Y.q = matrix(NA, nrow(College.train), ncol = S)
rmse.rep.qpoi = rep(0, S)
# Compute the rmse of the training data
rmse.qpoi.train = rmse(predict(model, newdata=College.train), College.train$Apps)
# Compute the rmse of the test data
rmse.qpoi.test = rmse(predict(model, newdata=College.test), College.test$Apps)
# The dispersion prameter
d<-(qp.sim@sigma[1])^2
for (s in 1:S) {
# generate the mu's
mu<-exp(X.q %*% qp.sim@coef[s,])
# generate the response realization
Y.q[, s] = rnegbin(n, mu=mu, theta=mu/(d-1))
# calculate the RMSE
rmse.rep.qpoi[s] = rmse(predict(model), Y.q[, s] )
}
RMSE.rep.qp= data.frame(rmse.rep.qpoi)
#plot the histogram
GGPLOT.13<-ggplot(data= RMSE.rep.qp, aes(x=RMSE.rep.qp))+
geom_histogram(fill = "slateblue") +
geom_vline(xintercept = rmse.qpoi.train, color="red")
GGPLOT.13.2<-GGPLOT.13+geom_vline(xintercept = rmse.qpoi.test, col="cornflowerblue") + xlab("RMSE")+ ylab("Count")
plot(GGPLOT.13.2)
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#compute p-value
pval.train<-sum(RMSE.rep.qp>rmse.qpoi.train)/nsim
print(pval.train)
## [1] 0.668
pval.test<-sum(RMSE.rep.qp>rmse.qpoi.test)/nsim
print(pval.test)
## [1] 0
ANSWER: Accounting for dispersion leads to a good fit to the training data based on the RMSE statistic. The corresponding p-value is ‘r pval.train’ and indicates that there is no evidence to reject the model. For the test data however, we are doing very poorly as can be seen visually and the p-value of 0.
ANSWER: First we consider the untransformed model as follows.
Apps.nb <- glm.nb(Apps ~ . -college, data = College.train, link = log)
summary(Apps.nb)
##
## Call:
## glm.nb(formula = Apps ~ . - college, data = College.train, link = log,
## init.theta = 3.652914483)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3065 -0.7978 -0.2008 0.5036 4.1839
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.731e+00 2.333e-01 20.274 < 2e-16 ***
## PrivateYes -4.762e-01 7.970e-02 -5.975 2.31e-09 ***
## Top10perc -5.152e-03 3.948e-03 -1.305 0.191948
## Top25perc 6.483e-03 2.713e-03 2.389 0.016874 *
## F.Undergrad 1.272e-04 7.382e-06 17.237 < 2e-16 ***
## P.Undergrad 4.474e-06 1.786e-05 0.250 0.802220
## Outstate 3.537e-05 1.089e-05 3.250 0.001156 **
## Room.Board 1.159e-04 2.839e-05 4.084 4.43e-05 ***
## Books 3.660e-04 1.315e-04 2.783 0.005391 **
## Personal -3.174e-05 3.874e-05 -0.819 0.412604
## PhD 1.003e-02 2.678e-03 3.745 0.000181 ***
## Terminal -4.722e-03 3.022e-03 -1.563 0.118089
## S.F.Ratio 3.796e-02 7.438e-03 5.103 3.34e-07 ***
## perc.alumni -7.534e-03 2.426e-03 -3.105 0.001902 **
## Expend 2.511e-05 7.644e-06 3.285 0.001019 **
## Grad.Rate 7.310e-03 1.706e-03 4.285 1.83e-05 ***
## EliteYes 2.203e-01 1.230e-01 1.791 0.073259 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.6529) family taken to be 1)
##
## Null deviance: 2339.63 on 581 degrees of freedom
## Residual deviance: 608.44 on 565 degrees of freedom
## AIC: 9654.4
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.653
## Std. Err.: 0.206
##
## 2 x log-likelihood: -9618.373
par(mfrow = c(2, 2))
plot(Apps.nb)
Apps.nb$deviance/Apps.nb$df.residual
## [1] 1.076887
pchisq(Apps.nb$deviance, Apps.nb$df.residual, lower.tail = F)
## [1] 0.1002212
The dispersion parameter for this model is quite small (3.6). Looking at the residual plot however we see a strong shape in the first subplot (residuals), and a trend in the third subplot (std. deviance residuals). In an attempt to improve things, we explored using hte same transformations as in the linear model, as follows:
Apps.nb <- glm.nb(Apps ~ Private + log(Top10perc) + log(Top25perc) +
log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board +
log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) +
Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate +
Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) +
log(Expend):log(Grad.Rate), data = College.train, link = log)
summary(Apps.nb)
##
## Call:
## glm.nb(formula = Apps ~ Private + log(Top10perc) + log(Top25perc) +
## log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board +
## log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) +
## Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate +
## Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) +
## log(Expend):log(Grad.Rate), data = College.train, link = log,
## init.theta = 7.93895779)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4634 -0.7555 -0.0975 0.5061 7.3945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.504e+01 8.697e+00 4.028 5.61e-05 ***
## PrivateYes 6.453e-01 1.702e-01 3.790 0.000150 ***
## log(Top10perc) -7.300e-01 1.715e-01 -4.258 2.07e-05 ***
## log(Top25perc) -4.337e-01 1.483e-01 -2.924 0.003452 **
## log(F.Undergrad) 9.547e-01 2.719e-02 35.113 < 2e-16 ***
## log(P.Undergrad) -1.593e-01 3.264e-02 -4.882 1.05e-06 ***
## Outstate -5.907e-04 1.488e-04 -3.971 7.15e-05 ***
## Room.Board 1.769e-03 4.907e-04 3.606 0.000311 ***
## log(PhD) 1.915e-01 6.568e-02 2.916 0.003542 **
## log(S.F.Ratio) -3.713e+00 1.049e+00 -3.541 0.000399 ***
## log(Expend) -2.761e+00 7.448e-01 -3.707 0.000209 ***
## log(Grad.Rate) -8.864e+00 2.170e+00 -4.084 4.42e-05 ***
## PrivateYes:Room.Board -1.621e-04 4.053e-05 -3.999 6.35e-05 ***
## log(Top10perc):log(Top25perc) 1.790e-01 4.374e-02 4.093 4.26e-05 ***
## log(P.Undergrad):Outstate 1.051e-05 2.481e-06 4.238 2.26e-05 ***
## Outstate:log(Expend) 6.043e-05 1.568e-05 3.854 0.000116 ***
## Room.Board:log(Expend) -1.716e-04 5.472e-05 -3.136 0.001713 **
## log(S.F.Ratio):log(Grad.Rate) 9.294e-01 2.543e-01 3.655 0.000258 ***
## log(Expend):log(Grad.Rate) 7.280e-01 1.882e-01 3.868 0.000110 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(7.939) family taken to be 1)
##
## Null deviance: 5072.09 on 581 degrees of freedom
## Residual deviance: 594.69 on 563 degrees of freedom
## AIC: 9181
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 7.939
## Std. Err.: 0.460
##
## 2 x log-likelihood: -9140.991
par(mfrow = c(2, 2))
plot(Apps.nb, labels.id = College.train$college)
Apps.nb$deviance/Apps.nb$df.residual
## [1] 1.056292
p.val.22<-pchisq(Apps.nb$deviance, Apps.nb$df.residual, lower.tail = F)
After adding in the transformations we seen an improvement in the residual and std deviance residual plots. The p-value for overdispersion is now at ‘r p.val.22’, indicating that we cannot reject this model. The remaining cause for concern is Talladega College with a high residual and std deviance residual - this college is an outlier, but since it does not have high leverage and a Cooks’ distance below .5 we are not worried.
S = 1000
class(Apps.nb) = "glm"
beta.nb = sim(Apps.nb, n.sims = 1000)
X.nb = model.matrix(Apps.nb)
Y.nb = matrix(NA, nrow(College.train), ncol = S)
beta.nb@sigma = rnorm(S, Apps.nb$theta, Apps.nb$SE.theta)
rmse.train.nb = rmse(predict(Apps.nb), College.train$Apps)
rmse.test.nb = rmse(predict(Apps.nb, newdata = College.test), College.test$Apps)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : calling predict.lm(<fake-lm-object>) ...
rmse.rep.nb = rep(0, S)
rmse.test.nb
## [1] 5787.761
for (s in 1:S) {
mu = exp(X.nb %*% beta.nb@coef[s, ])
Y.nb[,s] = rnegbin(n, mu=mu, theta=beta.nb@sigma[s])
rmse.rep.nb[s] = rmse(predict(Apps.nb), Y.nb[,s])
}
rmse.nb.df = data.frame(rsme.test.nb = rmse.test.nb, rmse.rep.nb = rmse.rep.nb)
rmse.rep.nb = as.data.frame(rmse.rep.nb)
hist.nb <-ggplot(data= rmse.rep.nb, aes(x=rmse.rep.nb)) + geom_histogram(fill = "slateblue") + geom_vline(xintercept = rmse.train.nb, color = "red") + geom_vline(xintercept = rmse.test.nb, col="cornflowerblue") + xlab("RMSE") + ylab("Count")
print(hist.nb)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pval.nb <-sum(rmse.rep.nb > rmse.test.nb)/S
print(pval.nb)
## [1] 0.04
ANSWER: From the histogram, it is clear that the Negative Binomial helps to explain more of the variation in the data than the Poisson or Quasi-Poisson. Judging from the small RMSE for the training data, we may be overfitting the data. The RMSE of the test set is closer to the mean of the simulations than those of the former, but the p-value is small at ‘r pval’, indicating that we should be careful when using the model for prediction purposes!
#get prediction interval directly for linear regression models
pi.lm = predict.lm(object = apps.lm0,newdata =
College.test[,-17],se.fit=TRUE,interval="prediction",level=0.95)$fit[,2:3]
pi.best = exp(predict.lm(object = app_best,newdata =
College.test[,-17],se.fit=TRUE,interval="prediction",level=0.95)$fit[,2:3])
#get prediction interval for poisson regression with simulation
S = 1000
beta = sim(Apps.poi, n.sims = 1000)
X = model.matrix(Apps.poi,data=College.test[,-17])
Y = matrix(NA, nrow(College.test), ncol = S)
for (s in 1:S) {
lambda = exp(X %*% beta@coef[s,])
Y[, s] = rpois(nrow(College.test),lambda)
}
pi.poi = t(apply(Y, 1, function(x) {quantile(x, c((1 - .95)/2,.5 + .95/2))}))
#prediction interval for quasipoisson
# Simulate 1000 replicates
S = 1000
n<-nrow(College.test)
qp.sim = sim(Apps.qpoi.2, n.sims = 1000)
# Generate the design matrix
X.q = model.matrix(Apps.qpoi.2, data = College.test[,-17])
# Allocate the output
Y.q = matrix(NA, nrow(College.test[,-17]), ncol = S)
# ????? should be the square, correct????
d<-(qp.sim@sigma[1])^2
for (s in 1:S) {
# generate the mu's
mu<-exp(X.q %*% qp.sim@coef[s,])
# generate the response realization
Y.q[, s] = rnegbin(n, mu=mu, theta=mu/(d-1))
}
pi.qpoi.2 = t(apply(Y.q, 1, function(x) {quantile(x, c((1 - .95)/2,.5 + .95/2))}))
#prediction interval for negative binomial
nsim=1000
require(mvtnorm)
n = nrow(College.test)
X = model.matrix(Apps.nb, data=College.test[,-17])
class(Apps.nb)<-"glm"
sim.nb<-sim(Apps.nb, nsim)
sim.nb@sigma<-rnorm(nsim, Apps.nb$theta, Apps.nb$SE.theta)
#beta = rmvnorm(nsim, coef(Apps.nb), vcov(Apps.nb))
#theta = rnorm(nsim, Apps.nb$theta, Apps.nb$SE.theta)
y.rep = matrix(NA, nsim, n)
for (i in 1:nsim) {
mu = exp(X %*% sim.nb@coef[i,])
y.rep[i,] = rnegbin(n, mu=mu, theta=sim.nb@sigma[i])
}
pi.nb = t(apply(y.rep, 2, function(x) {quantile(x, c((1 - .95)/2,.5 + .95/2))}))
coverage = function(mod,test.df){
y = test.df[,"Apps"]
if(identical(mod,apps.lm0)){
return(mean(y >= pi.lm[,1] & y <= pi.lm[,2]))
}
if(identical(mod,app_best)){
return(mean(y >= pi.best[,1] & y <= pi.best[,2]))
}
if(identical(mod,Apps.poi)){
return(mean(y >= pi.poi[,1] & y <= pi.poi[,2]))
}
if(identical(mod,Apps.qpoi.2)){
return(mean(y >= pi.qpoi.2[,1] & y <= pi.qpoi.2[,2]))
}
if(identical(mod,Apps.nb)){
return(mean(y >= pi.nb[,1] & y <= pi.nb[,2]))
}
}
# coverage for each of the 5 models that you fit
a = coverage(apps.lm0,College.test[,-17])
b =coverage(app_best,College.test[,-17])
c =coverage(Apps.poi,College.test[,-17])
d =coverage(Apps.qpoi.2,College.test[,-17])
e =coverage(Apps.nb,College.test[,-17])
a;b;c;d;e
## [1] 0.9589744
## [1] 0.9487179
## [1] 0.08205128
## [1] 0.9641026
## [1] 0.3333333
par(mfrow=c(2,2))
#plots of the confidence intervals versus ordered by the prediction, with the left out data added as points
df = data.frame(apps = College.test$Apps,
pred = predict(apps.lm0,College.test[,-17],type="response"),
lwr = pi.lm[,1], upr=pi.lm[,2])
df = df %>% arrange(pred) # sort by prediction
gp = ggplot(df, aes(x=pred, y=apps)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y=apps)) +
xlab("predicted value") +
ylab("Apps") +
ggtitle("95% Prediction Intervals under linear regression Model(apps.lm0)")
gp
# plot for app_best
df = data.frame(apps = College.test$Apps,
pred = exp(predict(app_best,College.test[,-17])),
lwr = pi.best[,1], upr=pi.best[,2])
df = df %>% arrange(pred) # sort by prediction
gp = ggplot(df, aes(x=pred, y=apps)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y=apps)) +
xlab("predicted value") +
ylab("Apps") +
ggtitle("95% Prediction Intervals under linear regression Model(app_best)")
gp
#plot for Apps.poi
df = data.frame(apps = College.test$Apps,
pred = predict(Apps.poi,College.test[,-17],type="response"),
lwr = pi.poi[,1], upr=pi.poi[,2])
df = df %>% arrange(pred) # sort by prediction
gp = ggplot(df, aes(x=pred, y=apps)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y=apps)) +
xlab("predicted value") +
ylab("Apps") +
ggtitle("95% Prediction Intervals under Poisson regression Model(Apps.poi)")
gp
#plot for Apps.qpoi.2
df = data.frame(apps = College.test$Apps,
pred = predict(Apps.qpoi.2,College.test[,-17],type="response"),
lwr = pi.qpoi.2[,1], upr=pi.qpoi.2[,2])
df = df %>% arrange(pred) # sort by prediction
gp = ggplot(df, aes(x=pred, y=apps)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y=apps)) +
xlab("predicted value") +
ylab("Apps") +
ggtitle("95% Prediction Intervals under QuasiPoisson regression Model(Apps.qpoi.2)")
gp
#plot for Apps.nb
df = data.frame(apps = College.test$Apps,
pred = predict.glm(Apps.nb,College.test[,-17],type="response"),
lwr = pi.nb[,1], upr=pi.nb[,2])
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : calling predict.lm(<fake-lm-object>) ...
df = df %>% arrange(pred) # sort by prediction
gp = ggplot(df, aes(x=pred, y=apps)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y=apps)) +
xlab("predicted value") +
ylab("Apps") +
ggtitle("95% Prediction Intervals under Negative Binomial Model(Apps.nb)")
gp
College.test %>%
filter(Apps>40000)%>%
select(college)
## college
## 1 Rutgers at New Brunswick
From the plots, it appears that only the first two, which are both linear regression models, indicate good coverage. In addition, the quasipoisson model also provides a very good coverage. The other two glm models seem to provide bad coverage for prediction, probably due to overfitting. From the plot, it can be observed that one point(Rutgers at New Brunswick) has unusually high y value, and the models seems to predict poorly on that outlier point.
RMSE_obs = c(RMSE.1,RMSE.1.bestlm,rmse.train.poi,rmse.qpoi.train,rmse.train.nb)
RMSE_test = c(RMSE.0.test, RMSE.best.test,rmse.test, rmse.qpoi.test,rmse.test.nb)
coverage = c(a,b,c,d,e)
pvalue = c(pval.lm,pval.best,pval.poi,pval.train,pval.nb)
row =c("apps.lm0","app_best","Apps.poi","Apps.qpoi.2","Apps.nb")
out = data.frame(row.names = row,RMSE_obs,RMSE_test,coverage,pvalue)
kable(out)
| RMSE_obs | RMSE_test | coverage | pvalue | |
|---|---|---|---|---|
| apps.lm0 | 1441.111 | 2894.803 | 0.9589744 | 0.315 |
| app_best | 1244.907 | 2781.898 | 0.9487179 | 0.386 |
| Apps.poi | 4550.667 | 5787.812 | 0.0820513 | 0.000 |
| Apps.qpoi.2 | 4550.624 | 5787.781 | 0.9641026 | 0.668 |
| Apps.nb | 4550.613 | 5787.761 | 0.3333333 | 0.040 |
I think app_best, which is the linear regression model with transformations and interactions added. From the table, app_best’s RMSE on the observed data and on the test data are relatively small, compared to the other models. It also has a great coverage of 0.94. Predictive check p-value is also bigger than 0.05, and it appears to be a good fit.
kable() or xtable()) of relative risks and 95% confidence intervals. Pick 5 of the most important variables and provide a paragraph that provides an interpretation of the parameters (and intervals) that can be provided to a university admissions officer about which variables increase admissions.summary(app_best)
##
## Call:
## lm(formula = log(Apps) ~ log(Top10perc) + Top25perc + log(F.Undergrad) +
## log(P.Undergrad) + Outstate + Room.Board + log(PhD) + Elite,
## data = College.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52512 -0.23310 0.02069 0.22267 2.09038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.086e+00 2.559e-01 -4.244 2.56e-05 ***
## log(Top10perc) -1.732e-01 5.089e-02 -3.404 0.00071 ***
## Top25perc 5.779e-03 1.881e-03 3.073 0.00222 **
## log(F.Undergrad) 9.778e-01 2.288e-02 42.736 < 2e-16 ***
## log(P.Undergrad) -6.152e-02 1.422e-02 -4.327 1.79e-05 ***
## Outstate 3.134e-05 6.572e-06 4.769 2.35e-06 ***
## Room.Board 9.707e-05 2.051e-05 4.734 2.78e-06 ***
## log(PhD) 1.986e-01 6.711e-02 2.959 0.00321 **
## EliteYes 1.828e-01 6.830e-02 2.677 0.00765 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3802 on 573 degrees of freedom
## Multiple R-squared: 0.8734, Adjusted R-squared: 0.8716
## F-statistic: 493.9 on 8 and 573 DF, p-value: < 2.2e-16
#since our best model is linear regression model, relative risk doesn't make much sense here, so we just
#provide the coefficient estimates and the confidence intervals.
out = as.data.frame(cbind(app_best$coefficients,confint(app_best)))
kable(out,col.names = c("coef estimates","2.5% CI","97.5% CI"))
| coef estimates | 2.5% CI | 97.5% CI | |
|---|---|---|---|
| (Intercept) | -1.0858859 | -1.5884221 | -0.5833496 |
| log(Top10perc) | -0.1732362 | -0.2731933 | -0.0732791 |
| Top25perc | 0.0057795 | 0.0020853 | 0.0094736 |
| log(F.Undergrad) | 0.9777667 | 0.9328289 | 1.0227045 |
| log(P.Undergrad) | -0.0615160 | -0.0894422 | -0.0335898 |
| Outstate | 0.0000313 | 0.0000184 | 0.0000443 |
| Room.Board | 0.0000971 | 0.0000568 | 0.0001374 |
| log(PhD) | 0.1985791 | 0.0667642 | 0.3303940 |
| EliteYes | 0.1828149 | 0.0486669 | 0.3169628 |
#pick 5 of the most important variables and their confidence intervals
cbind(app_best$coefficients[5:9],confint(app_best,c("log(F.Undergrad)","log(P.Undergrad)","Outstate","Room.Board","log(PhD)")))
## 2.5 % 97.5 %
## log(P.Undergrad) -6.151598e-02 9.328289e-01 1.022705e+00
## Outstate 3.134350e-05 -8.944216e-02 -3.358980e-02
## Room.Board 9.707301e-05 1.843528e-05 4.425172e-05
## log(PhD) 1.985791e-01 5.679444e-05 1.373516e-04
## EliteYes 1.828149e-01 6.676415e-02 3.303940e-01
ANSWER:
The five most important variables coefficient estimates are listed in the table above, along with their confidence intervals. The number of full time undergraduates, percentage of facaulty with Phd’s seem to have a positive effect on application, while the number of part time undergraduates and out of state tuition seem to have a negative effect on the number of applications.From the model, Room and board costs seems to have a positive relationship, which appears to contradict with our common sense. However, considering the fact that private and prestigious schools usually have high cost of room and board, it makes sense that higher room and board costs have a positive effect on the number of applications.
ANSWER: The residual deviance has mean \(n-p-1\) and standard deviation \(\sqrt{2(n-p-1)}\) (basic results from distribution theory). By definition of a \(\Chi^2\), the residual deviance is equal in distribution to \(\sum_{i=1}^{n-p-1} Z_i^2\) where \(Z_i\) are i.i.d. standard normals. By virtue of the CLT, we have for large residual degrees of freedom \(df=n-p-1\), \[ \sqrt{n-p-1} \left( \frac{\sum_{i=1}^{n-p-1} Z_i^2}{n-p-1}-1\right) \approx \mathcal{N}(0, 2).\] Therefore, we have that the residual deviance \(D\) divided by \(df\) satisfies \[\frac{D}{df}-1 \approx \mathcal{N}(0, \frac{2}{df})\] Using the Empirical Rule (95% confidence interval approximated by \(\pm 2\sigma\) for normal), we derive a heuristic cutoff of \[ \frac{D}{f}> 1+ 2 \sqrt{\frac{2}{df}}=1+ \sqrt{\frac{8}{df}}.\]